Session 1a

Peter Prevos

02 February 2022

Data Science for Water Professionals

Part 1

  • Introduction to data science
  • Principles of the R language
  • Analyse water quality data

Part 2

  • Collect and clean and analyse customer data
  • Cluster analysis
  • Linear regression

Part 3

  • Analysing time series
  • Principles of machine learning

Learning Objectives Part 1

  • Apply the principles of strategic data science to solve water problems
  • Write R code to load, analyse, and visualise data
  • Diagnose water quality data with descriptive statistics
  • Develop presentations, reports, and applications to share results

Course Book

peterprevos.com/r4h2o

Program Session 1a

  1. Principles of Data Science
  2. Basics of the R Language
  3. Case Study 0: Open channel flows
  4. Reading CSV and Spreadsheet data
  5. Descriptive Statistics

Introductions

Principles of Data Science

What is Data Science?

Conway Venn Diagram (drewconway.com).

Don’t Believe the Hype

Source: Harvard Business Review.

What is Good Data Science?

Vitrivian triangle (venustas, firmitas, utilitas)

What is Useful Data Science?

Data - Information - Knowledge - Wisdom … Conspiracy

What is Sound Data Science?

Spreadsheets are Chaos

Code is Poetry

reservoirs %>%
    select(Date, River_Flow, Natural_Flow) %>%
    muate(ERV = ifelse(Natural_Flow <= 8, 4, Natural_Flow / 2),
          Date = as.Date(Date, format = "%d/%m/%Y")) %>%
    pivot_longer(-Date,
                 values_to = "Flow",
                 names_to = "Source")  %>%
    ggplot(aes(Date, Flow, col = Source)) +
    geom_line()

Basics of the R Language

Desktop version

  1. Download and install R and RStudio
  2. Download course materials
  3. Unzip folder
  4. In RStudio: File > Open Project
  5. Open r4h2o.Rproj

Cloud Version

  1. Register at cloud.rstudio.com
  2. New Project > New Project From Github Repo
    • https://github.com/pprevos/r4h2o/
  3. RStudio will download the project files

Principles of R

  • Console (REPL: Read-Eval-Print Loop)
  • Mathematical operators (+ - * ^ %% %/%)
  • Variable assignment
  • Assignment operator (<- versus =)
  • Functions (sum(), prod(), abs(), log(a, base = b))
    • Parameters
    • Use = sign
    • Auto-completion

BODMAS

3 - 3 * 6 + 2
## [1] -13

Arithmetic

diameter <- 150
pipe_area <- (pi / 4) * (diameter / 1000)^2
pipe_area
sqrt(pipe_area / (pi / 4)) * 1000

complaints <- c(12, 13, 23, 45, 22, 99, 31)

sum(complaints)
prod(complaints)
abs(complaints)
exp(complaints)
factorial(complaints)
log(complaints, base = 10)

Basic Plotting 1

x <- -10:10
y <- -x^2 -2 * x + 30
plot(x, y, type = "l", col = "blue", main = "Parabola")
abline(h = 0, col = "grey", lty = 2)
abline(v = 0, col = "grey", lty = 2)

Basic Plotting 2

names(complaints) <- month.name[1:7]
barplot(complaints,
        col = "lavender",
        main = "Complaints")

Coding Principles

First computer bug (1947)

Using the Editor

Natural Language

  • Words
    • Nouns (variables - meaningful
      • _ or . or camelCase
    • Verbs (functions)
  • Phrases (code)
  • Context

Bugs

  • Typos (lintr)
  • Error messes (copy and paste into search engine)
  • Help files

Scripts and Projects

  • Working Directory
  • Open Script
  • Commenting
  • Indentation

Case Study 0: Channel Flow

Discharge formula

\[Q = \frac{2}{3} C_d \sqrt{2g} \; lh^\frac{3}{2}\]

  • \(Q\): Discharge in m3/s
  • \(C_d\): Discharge coefficient
  • \(g\): Acceleration of gravity (9.81 m/s2)
  • \(l\): Crest length [m]
  • h: Head above crest level [m]

Coding Practice

Create an R script and answer:

  1. What is the flow in the channel in m3/s when the height \(h = 100\) mm?
  2. What is the average flow for these three heights: 150mm, 136mm, 75mm, in litres per second?
  3. Visualise the flow in m3/s for all heights between 50mm and 500mm

\(Q = \frac{2}{3} C_d \sqrt{2g} \; lh^\frac{3}{2}\)

  • \(Q\): Discharge in m3/s
  • \(C_d\): Discharge coefficient (\(C_d=0.62\))
  • \(g\): Acceleration of gravity (9.81 m/s2)
  • \(l\): Crest width (\(l=0.5\) m)
  • \(h\): Head above crest level [m]

Exploring Data with Tidyverse

Base R functions

  • Basic programming functions
  • Arithmetic
  • Statistics
  • Plotting
  • Extendable with functions

R Packages

  • Packages to extend base functionality
    • Distributed mainly through CRAN
    • Comprehensive R Archive Network
  • Call library to access functions: library(dplyr)
  • Or, add library name and two colons: dplyr::filter()

Tidyverse

  • Collection of R packages
  • Easy data manipulation
  • ‘syntactic sugar’

Case Study 1: Water Quality

Obtaining Data

  • Database (SQL, Oracle)
  • Desktop files (spreadsheets, CSV)
  • Web (HTML, XML, JSON)
    • API
    • Scraping

CSV files

  • Use only the top row as a header
  • Don’t use colours to indicate values
  • Prevent using spaces in column names
  • Don’t add any spreadsheet calculations
  • Every cell should be a data point or remain empty

Reading CSV files

  • readr package for CSV files (part of Tidyverse)
    • read_csv() faster alternative for read.csv()
    • Look at help text
  • File names in R
    • Unix-based (slash instead of backslash)
    • Results in data frame / tibble
library(tidyverse)
gormsey <- read_csv("casestudy1/gormsey.csv")

names(gormsey)
dim(gormsey)
ncol(gormsey)
nrow(gormsey)

head(gormsey)
str(gormsey)
glimpse(gormsey)

Variable classes

  • Integer (\({\ldots -3, -2, -1, 0, 1, 2, 3, \ldots}\))
  • Numeric (Real numbers) (\(\mathbb{R}\))
  • Complex numbers (\(a + bi\))
  • Text ("abcd")
  • Dates ("2022-02-02")
  • Factors (classifications): ("Male", "Female", "Other")
  • Boolean (TRUE, FALSE)

Variable Types

Scalar, vector and data frame / tibble (matrix)

Exploring data frames

  • df[rows, columns]
  • df$column
  • glimpse(df)
gormsey[12:13, ]
gormsey[, 4:5]
gormsey[1:2, c(2, 4:5, 6)]
gormsey[1:2, c(-1, -3, -7)]
gormsey$Date[1:6]

What is the sample number of the last sample in the Gormsey data?

Hint, use the nrow() function.

Conditionals

Compare variables

a <- 1:2
a == 1
a != 1
a >= 2

a <- c(TRUE, FALSE)
a * 2

"small" > "large" 

gormsey$Measure == "Turbidity" & gormsey$Result > 5

Filtering

# Three methods

gormsey[gormsey$Measure == "Turbidity", ]

subset(gormsey, Measure == "Turbidity")

library(dplyr)
filter(gormsey, Measure == "Turbidity")

turb5 <- filter(gormsey, Measure == "Turbidity" & Result > 5)

Counting Results

## [1] 2422
## [1] "Chlorine Total" "E. coli"        "Turbidity"      "THM"
## 
## Chlorine Total        E. coli            THM      Turbidity 
##            760            760            168            734
## # A tibble: 4 × 2
##   Measure            n
##   <chr>          <int>
## 1 Chlorine Total   760
## 2 E. coli          760
## 3 THM              168
## 4 Turbidity        734

Coding Practice

Write a script to answer these questions:

  1. How many turbidity results in all Towns, except Strathmore, are lower than 0.1 NTU?
  2. Which sample points had a non-zero E. coli count?
  3. Create a table that counts the number of THM samples per town.

Descriptive Statistics

Central Tendency

turbidity <- filter(gormsey, Measure == "Turbidity")

mean(turbidity$Result)
sum(turbidity$Result) / length(turbidity$Result)

median(turbidity$Result)

which.max(table(round(turbidity$Result, 1)))

Calculating the mode of continuous data is complex and requires a specialised package.

Dispersion

min(turbidity$Result)
max(turbidity$Result)
range(turbidity$Result)
diff(range(turbidity$Result))

var(turbidity$Result)
sd(turbidity$Result)

sqrt(sum((turbidity$Result - mean(turbidity$Result))^2) / (length(turbidity$Result) - 1))

Quantiles

  1. Place the observations in ascending order: \(y_1, y_2, \ldots y_n\).
  2. Calculate the rank (\(r\)) of the required percentile
  3. Interpolate between adjacent numbers: \[y_r = y_{\lfloor r \rfloor}+ (y_{r_{\lceil r \rceil}} - y_{\lfloor r \rfloor})(r - \lfloor r \rfloor)\]
    • \(y\): Observation
    • \(r\): Ranking number
    • \(\lfloor r \rfloor\): Floor or \(r\)
    • \(\lceil r \rceil\): Ceiling of \(r\)

IQR(turbidity$Result)
summary(turbidity$Result)
quantile(turbidity$Result, type = 6, probs = 0.95)

Quantiles (continued)

Grouped Analysis

measures <- group_by(gormsey, Measure)
summarise(measures, max = max(Result))

measures_town <- group_by(gormsey, Measure, Town)
summarise(measures,
          Samples = n(),
          Mean = mean(Result),
          Median = median(Result),
          p95 = quantile(Result, type = 6, probs = 0.95))

Coding Practice

  1. What is the average number of samples taken at the sample points in Gormesey?
  2. Which sample point has breached the maximum value of 0.25 mg/l for THM most often?
  3. What is the highest 95th percentile of the turbidity for each of the towns in Gormsey, using the Weibull method?

Next Session

  1. Visualise data
  2. ggplot2 package
  3. Data science workflow
  4. Presenting results